1 Introduction

          This report references and analyzes a dataset of daily temperature data compiled by the National Climatic Data Center for 321 cities from around the world running from 1995 to March 2020. The dataset consists of more than 2.9 million total rows. Variables include: region, country, city, day, month, year and daily average temperature in Fahrenheit. There are issues with missing values in this dataset, and an imbalanced frequency of observations by region which makes it difficult to complete meaningful statistical comparisons by regional grouping. This report documents the process of isolating intact subsets of the data for initial analysis, and identifies trends and relevant research questions.
          Temperature is a powerful indicator of large-scale climate change. However, variability in temperature can mask trends over time, weakening a linear fit. A hypothesis: seasonal variation has a meaningful impact on the linear model for monthly temperature. This report examines this question for a subset of New Zealand temperature data, summarized by month. It attempts to validate the effect of seasonality by comparing linear models with and without seasonality incorporated, using ANOVA. Correlation between imputed glacial ice volume data and temperature is also explored. Is there a relationship between increasing temperatures and glacial ice retreat?

2 Read in Data

city_temp_clean<- read.csv("city_temperature.csv")

# clean


city_temp_clean <- city_temp_clean %>%
  filter(Year!=200, Year!=201, Day!=0 , Year!=2020, State!="Washington DC")

#Year 200, year 201, Day 0 are all bad data.

#Year 2000 is incomplete (only thru May)
city_temp_clean$AvgTemperature[city_temp_clean$AvgTemperature == -99] <- NA

#creating a parsedDate column in order to use liquidate operations
city_temp_clean$parsedDate <- ymd(paste0(city_temp_clean$Year,"-",city_temp_clean$Month,"-",city_temp_clean$Day))

#Add a Celsius column with one decimal

city_temp_clean$AvgTempCelsius <- ((city_temp_clean$AvgTemperature-32)*(5/9))
city_temp_clean$AvgTempCelsius <- sapply(city_temp_clean$AvgTempCelsius, round,digits=1)


#Creating my character vector of cities with complete count of records.
full_record_cities <- city_temp_clean %>%
  group_by(City) %>%
  summarise(Annual_Avg=mean(AvgTempCelsius, na.rm=TRUE), Record_count=n()) %>% filter(Record_count>=9131) %>% arrange(-Record_count)%>% dplyr::select(City)

#Filtering the dataframe by complete records:
city_temp_clean <- city_temp_clean %>%
  filter(is.element(City, full_record_cities$City))

#Creating a character vector of Cities with fewer than 91 NAs (approximately 1% of the full record) , and filtering by that vector.

good_cities <- city_temp_clean %>%
  group_by(City) %>%
  summarise(nas_present = sum(is.na(AvgTempCelsius))) %>%
  filter(nas_present<=91)%>%
  dplyr::select(City)

city_temp_clean <- city_temp_clean %>%
  filter(is.element(City, good_cities$City))

city_list <- city_temp_clean%>%  
  group_by(City)%>%
  summarise(nas_present = sum(is.na(AvgTempCelsius)),record_count = n())%>%
  arrange(-nas_present)%>%
  unique()

#a list of duplicated date records
non_unique_dates <- city_temp_clean %>%
  group_by(Region, State, City, Country, parsedDate) %>%
  summarise(count_of_obs=n()) %>%
  filter(count_of_obs>1)%>%
  arrange(-count_of_obs)
          The cleaned dataset exhibits a large discrepancy in total records when considered by region. The table below summarizes this phenomenon and provides some general insight into regional temperature behavior.
          Notice that North America contains greater than 1.4 million records - more than all other regions combined. This discrepancy makes regional comparison statistically difficult. In addition, there is a range in variance by region. Africa is the warmest region with a mean temperature of 19.89 C, but its standard deviation is only about half of that for North America. Warmer regions or regions that encompass fewer discrete climatic regions exhibit less variability than seasonal areas.
summary_statistics_by_region <- city_temp_clean %>%
  group_by(Region) %>%
  dplyr::summarize(mean=mean(AvgTempCelsius, na.rm=TRUE), sd=sd(AvgTempCelsius,na.rm=TRUE),median=median(AvgTempCelsius, na.rm = TRUE), Record_count = n())%>%
  as.data.frame() 

  summary_statistics_by_region <-  rename(summary_statistics_by_region, "Mean Temperature (C)" = mean, "Median" = median, "Standard Deviation" = sd, "Number of Records" = Record_count)
  
summ_table_presentation <- summary_statistics_by_region%>% 
    kbl(caption = "Summary Statistics by Region", ) %>%
    kable_styling( font_size = 16)%>%
  kable_paper("hover", full_width = F)%>%
    kable_material_dark()
    
summ_table_presentation
Summary Statistics by Region
Region Mean Temperature (C) Standard Deviation Median Number of Records
Africa 19.88590 5.802537 20.1 54792
Asia 19.06741 11.727034 23.2 182640
Australia/South Pacific 16.46013 5.355089 16.4 45660
Europe 10.76569 8.230676 11.2 302376
Middle East 23.12594 9.564097 23.4 109583
North America 13.71096 10.592926 15.0 1415309
South/Central America & Carribean 17.27409 7.106319 16.9 54792
          The linear model of annual temperature over time shows a statistically significant relationship. It is graphed below. With a p-value of o.00097 and an R value of 0.62, this fit describes approximately 62% of the variation in annual temperature, and tracks a noticeable uptick in the average of over one degree between 1995 and 2019. At an alpha of 0.05 we have sufficient evidence to reject the hypothesis that there is no relationship between temperature and the passage of time during this period.
city_temp_clean%>%
  group_by(Year)%>%
  dplyr::summarize(
    "Annual Average Temperature (C)"=mean(AvgTempCelsius, na.rm=TRUE),SD=sd(AvgTempCelsius, na.rm=TRUE)
  )%>%
ggscatter(x = "Year", y = "Annual Average Temperature (C)", 
          add = "reg.line", 
          conf.int = TRUE, 
          add.params = list(color = "red",
                            fill = "lightblue"),
          color = "Annual Average Temperature (C)", 
          cor.coef = TRUE, cor.method = "pearson", 
          xlab = "Year", ylab = "Temperature in Celsius")+
          labs(title = "Linear Model of Annual Global Temperature", subtitle= "With Confidence Interval")+
          theme_bw()

3 Mumbai

3.0.1 Filtering data

Mumbai <- city_temp_clean %>%
  filter(City == "Bombay (Mumbai)")

3.1 Urbanization, Development and Temperature Increase: Mumbai, India

          There are important reasons to focus on cities when examining climate change and temperature change over time. As researchers have argued, high density, rapid population increase, and the urban development of cities (“urban heat islands”) make the impact of climate change more severe in cities (FPA, 2021; Lombrana and Dodge 2021). Mumbai (Bombay) India is the 9th most populated city (21 million) in the world and was selected as a case study for analysis because of its rapid growth in urbanization and industrialization over the past couple decades (https://ourworldindata.org/economic-growth). Traditional approaches to economic development (ie/ manufacturing) require a two-fold destruction of the environment: the development of land for industry (ie/factories) and the resulting pollution emitted by those industries. For this reason, analyzing manufacturing growth in a mega-city like Mumbai should be considered. The following research questions drive this case study:

1.Is there a significant positive correlation between time and temperature increase in Mumbai?
2.Is there a correlation between an increase in manufacturing growth (as indicated by US $) and temperature increase In Mumbai?

3.2 Time and temperature in Mumbai

For the first research question, hypotheses testing is as follows: Ho: There is no relationship between time (years) and temperature. Ha: There is a positive linear relationship between time (years) and temperature.

Although annual seasonal fluctuations in temperature are a critical factor in analyzing temperature, Mumbai is relatively consistent compared to other regions of the world (https://en.climate-data.org/asia/india/maharashtra/mumbai-29/). After creating a new data frame, a new variable (average yearly temperatures) was run against temperature. The output examines the relationship between time (Indep var) on temperature (Dep var).

3.2.1 Cleaning and regression with Avg Temp and time

MumbaiYearandTemp <-  Mumbai%>%
                      group_by(Year)%>%
                      dplyr::summarize("AvgTempCelsius"=mean(AvgTempCelsius, na.rm=TRUE))
      
  

summary(lm(AvgTempCelsius ~ Year, data = MumbaiYearandTemp))
## 
## Call:
## lm(formula = AvgTempCelsius ~ Year, data = MumbaiYearandTemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70343 -0.19792  0.03796  0.26524  0.47350 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -62.38500   18.30411  -3.408  0.00241 ** 
## Year          0.04496    0.00912   4.929 5.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3288 on 23 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 1 and 23 DF,  p-value: 5.558e-05

3.2.2 Interpretation of the regression output

For each year that goes by, there is a .045 increase in average temperature on average (C).The adjusted R-squared value is .493 which means this model accounts for roughly 50% of variation in temperature. The slope of the fit is positive and the p-value (at 0.00005558) means that this relationship is statistically significant. There is a positive linear relationship between time and temperature when time is measured in years. It is worth noting that at the annual level, this fit is based on just a couple of dozen datapoints. A longer temperature trend would be needed to really prove that this trend is steady through time. This model is only commenting on the trend during this particular time slot.

#linear model of time (yearly) with temp for Mumbai
  
MumbaiYearandTemp %>% 
ggscatter(x = "Year", y = "AvgTempCelsius", 
          add = "reg.line", 
          conf.int = TRUE, 
          add.params = list(color = "red",
                            fill = "lightblue"),
          cor.coef = TRUE, cor.method = "pearson", 
          xlab = "Year", ylab = "Temperature in Celsius")+
          labs(title = "Linear Model of Temperatures in Mumbai", subtitle= "With Confidence Interval")+
          theme_bw()
## `geom_smooth()` using formula 'y ~ x'

## Manufacturing growth and temperature

For the second research question, hypotheses testing is as follows: Ho: There is no relationship between manufacturing growth (US $) and temperature. Ha: There is a positive relationship between manufacturing increase (US $) and temperature. The following chart examines the relationship between manufacturing growth (independent variable) and temperature (dependent variable).

#added a column of economic data from same original datasource (Manufacturing - in US dollars 1995-2019) 

MumbaiYearandTemp['ManufactureGrowth'] <- c(85661962758, 93800942935, 93848956738, 96788159277, 102000000000, 109000000000, 112000000000,
120000000000, 127000000000, 137000000000, 149000000000, 176000000000, 188000000000, 197000000000, 219000000000, 235000000000, 243000000000,
256000000000, 269000000000, 290000000000, 328000000000, 354000000000,
380000000000, 401000000000, 391000000000)

3.2.3 Regression analysis of Avg Temp of each year

summary(lm(AvgTempCelsius~ ManufactureGrowth, data = MumbaiYearandTemp))
## 
## Call:
## lm(formula = AvgTempCelsius ~ ManufactureGrowth, data = MumbaiYearandTemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59609 -0.23756  0.05875  0.21587  0.44682 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.712e+01  1.311e-01 206.835  < 2e-16 ***
## ManufactureGrowth 3.504e-12  5.704e-13   6.143 2.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2902 on 23 degrees of freedom
## Multiple R-squared:  0.6213, Adjusted R-squared:  0.6048 
## F-statistic: 37.74 on 1 and 23 DF,  p-value: 2.883e-06

3.2.4 Interpretation of Manufacturing Growth and Temperature

          Manufacturing growth in total USD value is correlated with rise in temperature. The relationship is statistically significant with a p-value of 0.00002883, and manufacturing growth appears to explain 60% of variation in temperature. In addition, when manufacturing growth increases by 100,000,000,000 results in a temperature increase of roughly 0.25 degrees Celsius, on average. This is discernible from the graph, and from the linear regression. As manufacturing growth spikes, the hypothesis is that localized temperature in the city increases. The direction of the two variables is positive and the p-value is close to zero which means that this relationship is statistically significant. Therefore, the null hypothesis can be rejected. Industrialization does appear to trend with temperature.
#plotting line between Manuf Growth and Avg Temp  

MumbaiYearandTemp %>% 
ggscatter(x = "ManufactureGrowth", y = "AvgTempCelsius", 
          add = "reg.line", 
          conf.int = TRUE, 
          add.params = list(color = "red",
                            fill = "lightblue"),
          color = "red", 
          cor.coef = TRUE, cor.method = "pearson", 
          xlab = "Manufacturing (Growth in US Dollars)", ylab = "Temperature (C)")+
          labs(title = "Linear Regression between Manufacturing Growth and Avg. Temperature (C)", subtitle= "With Confidence Interval")+
          theme_bw()
## `geom_smooth()` using formula 'y ~ x'

### Limitations and analysis considerations

As discussed previously, these data are limited for various reasons and any conclusions drawn from modeling must consider the complexity of climate change. While manufacturing seems to play a role in the increase of temperature in Mumbai, understanding this more fully requires knowledge of historical and regional characteristics, accurate data on a range of industries, and the role of economic sectors to name a few. Other variables from the dataset would have been interesting to examine such as natural resource depletion, consumption and energy use. However, missing data for many years would create challenges.

4 Beijing

library(pacman)
p_load(tidyverse)
p_load(VIM)
p_load(lubridate)

options(scipen = 999)

4.1 Introduction

          China is often cited as having severe air pollution. So much in fact that in 2014, the Chinese government declared a “War on Pollution”. This data on Beijing air quality was obtained through the UC Irvine Machine Learning Repository, and was collected by four nationally-controlled air quality monitoring sites throughout the city of Beijing, which were produced in part because of the clean air initiative. The data set contains hourly measurements of four major air pollutants, carbon monoxide, nitrogen dioxide, sulfur dioxide, and ozone. The pollutants are measured in micrograms per cubic meter.
df_1 <- read_csv("data/PRSA_Data_Tiantan_20130301-20170228.csv")

df_2 <- read_csv("data/PRSA_Data_Shunyi_20130301-20170228.csv")

df_3 <- read_csv("data/PRSA_Data_Dingling_20130301-20170228.csv")

df_4 <- read_csv("data/PRSA_Data_Dongsi_20130301-20170228.csv")

4.1.1 Importing Cleaning and Adding Pollutant Data

For the data to be comparable with the temperature data set, the hourly measurements were converted into daily averages.

df <- bind_rows(df_1, df_2, df_3, df_4)

df
df <- df %>% 
  mutate(date = make_date(year, month, day))
df_avgs <- df %>% 
  group_by(date) %>% 
  summarise(
    avg_co = mean(CO, na.rm = TRUE),
    avg_so2 = mean(SO2, na.rm =TRUE),
    avg_No2 = mean(NO2, na.rm =TRUE),
    avg_O3 = mean(O3, na.rm =TRUE)
  )
df_avgs
## # A tibble: 1,461 × 5
##    date       avg_co avg_so2 avg_No2 avg_O3
##    <date>      <dbl>   <dbl>   <dbl>  <dbl>
##  1 2013-03-01   350     5.84    19.2   71.4
##  2 2013-03-02   924.   25.2     53.3   34.2
##  3 2013-03-03  1688.   41.4     73     25.7
##  4 2013-03-04   738.   13.8     38.8   61.9
##  5 2013-03-05  1970.   62.7     96.6   84.5
##  6 2013-03-06  2637.   89.2    120.    56.5
##  7 2013-03-07  3505.   75.8    132.    55.3
##  8 2013-03-08  2684.   54.0    102.    97.6
##  9 2013-03-09  1186.   26.1     48.8   91.3
## 10 2013-03-10   725.   18.8     40.5   82.0
## # … with 1,451 more rows
summary(df_avgs)
##       date                avg_co          avg_so2           avg_No2       
##  Min.   :2013-03-01   Min.   : 194.8   Min.   :  1.208   Min.   :  8.594  
##  1st Qu.:2014-03-01   1st Qu.: 610.1   1st Qu.:  4.396   1st Qu.: 29.323  
##  Median :2015-03-01   Median : 931.2   Median :  8.684   Median : 38.758  
##  Mean   :2015-03-01   Mean   :1179.6   Mean   : 14.560   Mean   : 44.562  
##  3rd Qu.:2016-02-29   3rd Qu.:1420.8   3rd Qu.: 18.547   3rd Qu.: 54.208  
##  Max.   :2017-02-28   Max.   :7857.9   Max.   :127.053   Max.   :153.802  
##      avg_O3       
##  Min.   :  2.189  
##  1st Qu.: 29.918  
##  Median : 54.802  
##  Mean   : 59.375  
##  3rd Qu.: 82.776  
##  Max.   :180.781
beijing <- city_temp_clean %>% 
  filter(City == "Beijing")

beijing <- beijing %>% 
  filter(parsedDate >= "2013-03-01") 

beijing <- beijing %>% 
  filter(parsedDate <= "2017-02-28") 

colnames(beijing)[9] <- "date"
total <- merge(beijing, df_avgs, by = "date")

total

4.3 Performing Regressions on Temperature and Pollution Data

          Now that the data is in the proper format, the next step is to determine whether or not there is any correlation between average greenhouse gas pollutant concentration and average daily temperature. First, a simple linear regression model was run, with average temperature as the dependent variable, and each of the pollutant concentrations as the independent variables. This model generated an adjusted R squared value of 0.581, implying that roughly 58% of the data points are explained by this model. It also generated a p value of 0.00000000000000022, implying the results are significant, and likely not the result of random chance. This model is adequate, but can likely be improved.
model <- lm(AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3, data = total)

coef(model)
##  (Intercept)       avg_co      avg_so2      avg_No2       avg_O3 
## 29.627620983 -0.002147985 -0.459513721  0.286439234  0.385010587
summary(model)
## 
## Call:
## lm(formula = AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3, 
##     data = total)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.074  -8.674   0.575   8.702  43.060 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 29.6276210  1.2936175  22.903 < 0.0000000000000002 ***
## avg_co      -0.0021480  0.0007055  -3.044              0.00237 ** 
## avg_so2     -0.4595137  0.0286716 -16.027 < 0.0000000000000002 ***
## avg_No2      0.2864392  0.0309857   9.244 < 0.0000000000000002 ***
## avg_O3       0.3850106  0.0105364  36.541 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.82 on 1450 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.5822, Adjusted R-squared:  0.581 
## F-statistic: 505.1 on 4 and 1450 DF,  p-value: < 0.00000000000000022
          The next step is to perform a step wise regression, which will remove any independent variables that are not contributing strongly to the model. When the step wise model was run it generated the same regression equation with the same coefficients as the linear model. This, along with the p values, tells us that all of the pollutants are significant contributors to average daily temperature, but the model is not improved from the simple linear regression run above.
model2 <- step(lm(AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3, data = total), direction = "both")
## Start:  AIC=7429.19
## AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 238437 7429.2
## - avg_co   1      1524 239961 7436.5
## - avg_No2  1     14052 252489 7510.5
## - avg_so2  1     42237 280674 7664.5
## - avg_O3   1    219566 458003 8377.0
summary(model2)
## 
## Call:
## lm(formula = AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3, 
##     data = total)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.074  -8.674   0.575   8.702  43.060 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 29.6276210  1.2936175  22.903 < 0.0000000000000002 ***
## avg_co      -0.0021480  0.0007055  -3.044              0.00237 ** 
## avg_so2     -0.4595137  0.0286716 -16.027 < 0.0000000000000002 ***
## avg_No2      0.2864392  0.0309857   9.244 < 0.0000000000000002 ***
## avg_O3       0.3850106  0.0105364  36.541 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.82 on 1450 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.5822, Adjusted R-squared:  0.581 
## F-statistic: 505.1 on 4 and 1450 DF,  p-value: < 0.00000000000000022
          Looking at the graphs above, it can be seen that there is a certain amount of seasonal fluctuation in the concentration of the pollutants. For this reason, it was determined that a model should be produced that accounted for seasonality as a factor. Using the code below, a “season” variable was added to the dataframe depending on the “Month” column.
map_to_season_north <- function(month_int){
  
  months_df <- data.frame(month_num = c(1,2,3,4,5,6,7,8,9,10,11,12), 
                          season=c('winter', 'winter', 'spring', 'spring', 'spring', 'summer', 'summer', 'summer', 'fall', 'fall', 'fall', 'winter'))
  
  result <- months_df[months_df$month_num == month_int,]$season
  return(result)
}

season <- unlist(lapply(month(total$Month), map_to_season_north))


total_w_season <- cbind(total, season)

total_w_season
          Once seasonality was added to the data frame, it becomes possible to control for seasonality as a factor. When the model is run, it generates an adjusted R squared value of 0.823, implying that roughly 83% of data points can be explained by the model. This is a significant improvement over the simple linear model that did not account for seasonality. It also generated a p value of 0.00000000000000022. From these results, we can conclude with high confidence that there is a high correlation between air pollution concentration and temperature.
model3 <- lm(AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3 + as.factor(season), data = total_w_season)


summary(model3)
## 
## Call:
## lm(formula = AvgTemperature ~ avg_co + avg_so2 + avg_No2 + avg_O3 + 
##     as.factor(season), data = total_w_season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.798  -4.849   0.434   5.556  24.026 
## 
## Coefficients:
##                            Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)              43.1958690   0.9399281  45.957 < 0.0000000000000002
## avg_co                   -0.0012077   0.0004892  -2.469             0.013672
## avg_so2                  -0.1053013   0.0208949  -5.040       0.000000525406
## avg_No2                   0.1340333   0.0215565   6.218       0.000000000658
## avg_O3                    0.2076950   0.0089015  23.333 < 0.0000000000000002
## as.factor(season)spring  -2.4660783   0.7191434  -3.429             0.000622
## as.factor(season)summer  13.8525639   0.7833078  17.685 < 0.0000000000000002
## as.factor(season)winter -21.8083026   0.7027941 -31.031 < 0.0000000000000002
##                            
## (Intercept)             ***
## avg_co                  *  
## avg_so2                 ***
## avg_No2                 ***
## avg_O3                  ***
## as.factor(season)spring ***
## as.factor(season)summer ***
## as.factor(season)winter ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.332 on 1447 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.824,  Adjusted R-squared:  0.8231 
## F-statistic: 967.6 on 7 and 1447 DF,  p-value: < 0.00000000000000022

5 New Zealand

5.1 Introduction

          New Zealand’s temperature data is from a single source - the city of Auckland. Auckland is located on the North Island of New Zealand. This temperature record will be used as a proxy for country-wide temperature for the purposes of this analysis. Below is a plot of annual average temperature, with linear fit. There is a clear increasing trend in temperature at the annual level during this period, similar to the global trend observed.
          New Zealand is famous for its natural beauty and glaciers, and the country is also well known for a strong role in the development of R and data analytics. It is not particularly surprising that the government maintains a robust data repository of information relevant to the country. This section delves into New Zealand-specific data.
city_temp_clean%>%
  group_by(Year)%>%
  filter(Country=="New Zealand")%>%
    dplyr::summarize(
    Annual_Avg=mean(AvgTempCelsius, na.rm=TRUE),SD=sd(AvgTempCelsius, na.rm=TRUE)) %>%
    ggplot(aes(Year,Annual_Avg)) +
      geom_line()+
      geom_smooth(method='lm',se=F)+
      #geom_abline(intercept=euro_intercept, slope=euro_slope, color='red', name="European Trend Line") +
      xlim(1995, 2019)+
      theme_bw()+
      labs(y="Annual Average Temperature (Degrees Celsius)", title = "Average Temperature in NZ With Best Fit Line", subtitle = "1995-2019")

## Data Preparation

          Initial data preparation tasks include assigning a dataframe of New Zealand (NZ)-specific records to speed up analysis. Preparation also includes adding a categorical variable for southern hemisphere seasonality, and preparing and joining glacial ice volume data which will be introduced further below. The following script prepares the new dataframe.
# new dataframe for New Zealand Data
nz_temp <- city_temp_clean%>%filter(Country=='New Zealand')
#this function takes month number as the input and assigns a categorical variable for season depending on the result. The output is tuned to the southern hemisphere. It will be used to add season information for the monthly record.
map_to_season_south <- function(month_int){
  
  months_df <- data.frame(month_num = c(1,2,3,4,5,6,7,8,9,10,11,12), 
                          season=c('summer', 'summer', 'fall', 'fall', 'fall', 'winter', 'winter', 'winter', 'spring', 'spring', 'spring', 'summer'))
  
  result <- months_df[months_df$month_num == month_int,]$season
  return(result)
}

map_to_season_south(1)
# assigning a dataframe of monthly temperatures, with season information and an additional date object that denotes unique month-year combo.

nz_monthly_temp <- nz_temp%>%
  group_by(Year=year(parsedDate),Month=month(parsedDate)) %>% 
  summarise(AvgTempCelsius=mean(AvgTempCelsius, na.rm = TRUE))%>%
  mutate(firstofmonth = ymd(paste0(Year,'-' , Month, '-1')))

nz_seasons <- unlist(lapply(month(nz_monthly_temp$firstofmonth), map_to_season_south))

nz_monthly_temp <- cbind(nz_monthly_temp, nz_seasons)

names(nz_monthly_temp) <- c('Year','Month','AvgTempCelsius','firstofmonth','season')

head(nz_monthly_temp)
## # A tibble: 6 × 5
## # Groups:   Year [1]
##    Year Month AvgTempCelsius firstofmonth season
##   <dbl> <dbl>          <dbl> <date>       <chr> 
## 1  1995     1           19.5 1995-01-01   summer
## 2  1995     2           20.1 1995-02-01   summer
## 3  1995     3           18.7 1995-03-01   fall  
## 4  1995     4           17.9 1995-04-01   fall  
## 5  1995     5           14.2 1995-05-01   fall  
## 6  1995     6           11.9 1995-06-01   winter
          The New Zealand Ministry for the Environment Published an annual record of ice volume from 1995 through 2020. This record pulled from NOAA-satellite observations to generate an estimate of country-wide glacial ice volume. We applied a seasonal multiplier and a small amount of randomness to introduce variability in an interpolated monthly record of ice volume over the period, and joined it to the monthly temperature record for Auckland.
#annual ice file
nz_ice<- read.csv("data/annual_glacier_volume.csv")
#expand to dataframe of monthly data, with columns for annual total as repeating values per year for monthly volume (/12). 
nz_monthly_ice <- nz_ice%>%mutate(month = 1) %>%
  complete(year, month = 1:12) %>%
  fill(ice_volume_km.3) 
  #mutate_at(vars(ice_volume_km.3), ~./12)

#character vector of seasons
nz_seasons <- unlist(lapply(nz_monthly_ice$month, map_to_season_south))

#add character vector of seasons to monthly ice dataframe
nz_monthly_ice <- cbind(nz_monthly_ice, nz_seasons)

                                                   
nz_monthly_ice <- nz_monthly_ice%>%mutate(firstofmonth = ymd(paste0(year,'-' , month, '-1')))
names(nz_monthly_ice) <- c("year", "month", "monthly_ice", "nz_seasons", "firstofmonth")
          A process of imputing from annual data to generate monthly ice records was undertaken. This was accomplished by setting a seasonal variance around the annual mean of 5 percent, meaning a 5% increase in annual mean volume was added in winter, and 5 percent was subtracted in summer. This was metered out by month, and then a small amount of variation was randomly added to each observation to simulate seasonal noise. Note: the researchers are aware that this data is estimated at best. The goal was to generate datapoints for ice so that the overall trend could be examined in relation to temperature. There are myriad approaches to imputing data, and none are perfect.
#dataframe of monthly seasonal multipliers for ice volume
temp_monthly = data.frame(list(1:12, c(.95, .95, .975, 1, 1.025, 1.05, 1.05, 1.025, 1, .975, .95, .925)))

names(temp_monthly)= c("month", "delta")
#joining to monthly ice record, and creating new monthly ice values using delta multipliers 
nz_monthly_ice <- left_join(nz_monthly_ice, temp_monthly, by="month")
nz_monthly_ice <- nz_monthly_ice %>% mutate(monthly_ice = delta* monthly_ice+runif(300,-1,1))
# redefining to selected columns, converting to numeric to generate firstofmonth variable for join to temp data
nz_monthly_ice <- nz_monthly_ice %>% select(year,month,monthly_ice)
nz_monthly_ice$year <- as.numeric(nz_monthly_ice$year)
nz_monthly_ice$month <- as.numeric(nz_monthly_ice$month)

nz_monthly_ice$firstofmonth = ymd(paste0(nz_monthly_ice$year,'-' , nz_monthly_ice$month, '-1'))
#joining ice volume and temp tada
nz_monthly_temp <- left_join(nz_monthly_temp, nz_monthly_ice, by="firstofmonth")
nz_monthly_temp%>%select(firstofmonth,season,AvgTempCelsius,monthly_ice)

# set seed for random number generator to add noise to glacial data
set.seed(1000)
nz_monthly_temp$monthly_ice <- nz_monthly_temp$monthly_ice + runif(1, -1, 1)

5.2 Examining the Temperature - Ice Relationship

          The volume of ice has declined precipitously, with a 38% reduction in glacial volume over the period. This is a profoundly concerning result for a number of reasons, chief among them being that these glaciers took thousands or tens of thousands of years to accumulate the volume that was erased in a period of a couple of decades. The graph below depicts the trend of glacial ice volume for the imputed monthly record generated above.
          At first glance the rate of change for glacial ice far outpaces the modest overall increase in NZ temperature observed over the same period. Is it possible that temperature is a statistically significant predictor of the ice decline?
nz_monthly_temp%>%
    ggplot(aes(firstofmonth,monthly_ice)) +
    geom_line()+
    geom_smooth(method='lm',se=F)+
    #geom_abline(intercept=euro_intercept, slope=euro_slope, color='red', name="European Trend Line") +
    #xlim(1995, 2019)+
    theme_bw()+
    labs(y="Monthly Average Ice Volume (KM^3)",x="Month", title = "Monthly Ice in NZ With Best Fit Line", subtitle = "1995-2019")

          The next step is to use an appropriate correlation test to look for a relationship between temperature and ice volume. These two variables are assumed to be normal due to the volume of observations. As a result, the Pearson product-moment correlation test is appropriate. The null hypothesis is that there is no relationship between temperature and ice volume in NZ. The significance level for this test will be 0.01. Below is a scatter plot of temperature vs. ice observations. Ice is on the y axis. The Pearson test results are displayed on the chart. The Pearson test coefficient (R) equals -0.31, indicating a negative relationship between temperature and ice. The p-value of 0.00000004 is well below the significance level of .01. These results indicate that there is a negative relationship between temperature and ice, and that this relationship is significant. As temperature rises, glacial ice recedes.
ggscatter(nz_monthly_temp, x = "AvgTempCelsius", y = "monthly_ice",
          color = "red", cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "Temperature (C)", ylab = "Ice (km^3")

5.3 Examining and Refining a Linear Model of Monthly Temperature in New Zealand

          The linear fit for monthly temperature data is graphed and summarized below. Fitting and evaluating a model on monthly average date instead of daily observations improves performance and clears up the picture somewhat. The data are widely dispersed around the theoretical line of best fit in a regular pattern caused by season, and therefore the adjusted r-squared is still essentially zero. The line graph of monthly temperature below makes this cyclical effect abundantly clear.
# monthly temperatures in NZ, adding a unique column "first of month" that denotes each unique month and year
 

nz_monthly_temp %>%
  lm(AvgTempCelsius~firstofmonth, data= .) %>%
  summary()
## 
## Call:
## lm(formula = AvgTempCelsius ~ firstofmonth, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8506 -2.5745 -0.2731  3.0434  6.5285 
## 
## Coefficients:
##                 Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)  15.10342233  0.97309104  15.521 <0.0000000000000002 ***
## firstofmonth  0.00002892  0.00006984   0.414               0.679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.189 on 298 degrees of freedom
## Multiple R-squared:  0.000575,   Adjusted R-squared:  -0.002779 
## F-statistic: 0.1714 on 1 and 298 DF,  p-value: 0.6791
nz_monthly_temp%>%
  lm(AvgTempCelsius~firstofmonth, data= .) %>%
  ggplot(aes(firstofmonth,AvgTempCelsius)) +
      geom_line()+
      geom_smooth(method='lm',se=F)+
      #geom_abline(intercept=euro_intercept, slope=euro_slope, color='red', name="European Trend Line") +
      #ylim(9, 16)+
      theme_bw()+
      labs(y="Average Temperature (Degrees Celsius)", x="Month", title = "Average Monthly Temperature in NZ With Best Fit Line", subtitle = "1995-2019")
## `geom_smooth()` using formula 'y ~ x'

The adjusted R-squared for a linear model of temperature ~ month across the record is 0.002779. This indicates essentially no relationship between temperature and time, with a p-value of 0.679. Below, the residual graphs for this fit (using the term loosely) are displayed.

nz_monthly_temp %>%
  lm(AvgTempCelsius~firstofmonth, data= .) %>% plot()

          The residual plots confirm that variance is weakening the quality of this fit. The QQ plot demonstrates that the standardized residuals fall well away from the theorized quantiles at either end of the distribution, another indicator of variance caused by seasonality.

5.4 Accounting for Seasonality in the New Zealand Temperature Record

          There is significant variance in the monthly temperature record, particularly spotlighted in the residuals vs. fitted chart above. The linear fit visibly displays a very very slight upward trend, but there is no statistical reason to trust it. Seasonal variability introduces too much variance. The following exercise attempts one method for accounting for this variability, by incorporating seasonality into the an improved linear model as a dummy variable. The restricted model is the linear model of temperature by month summarized above, while the full model incorporates seasonality. Note: seasonal shifts are different each year - they can start early, end late, or even reverse periodically. For the purpose of this investigation, this randomness is ignored. The categorical imputation function map_to_season_south assigns season by month - December, January, February being summer, March, April, June being autumn, and so forth. This is a very high-level approach to assigning seasonality in the hopes of reducing variance in the model.
restricted <- nz_monthly_temp %>%
  
    lm(AvgTempCelsius~firstofmonth, data= .)

full <- nz_monthly_temp %>%
  
    lm(AvgTempCelsius~firstofmonth+as.factor(season), data= .)


  nz_monthly_temp %>%
    lm(AvgTempCelsius~firstofmonth, data= .) %>%
    summary()
## 
## Call:
## lm(formula = AvgTempCelsius ~ firstofmonth, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8506 -2.5745 -0.2731  3.0434  6.5285 
## 
## Coefficients:
##                 Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)  15.10342233  0.97309104  15.521 <0.0000000000000002 ***
## firstofmonth  0.00002892  0.00006984   0.414               0.679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.189 on 298 degrees of freedom
## Multiple R-squared:  0.000575,   Adjusted R-squared:  -0.002779 
## F-statistic: 0.1714 on 1 and 298 DF,  p-value: 0.6791
  nz_monthly_temp %>%
    lm(AvgTempCelsius~firstofmonth+as.factor(season), data= .) %>%
    summary()
## 
## Call:
## lm(formula = AvgTempCelsius ~ firstofmonth + as.factor(season), 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7181 -0.8873 -0.0427  0.9653  3.5317 
## 
## Coefficients:
##                            Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)             15.88247064  0.45600982  34.829 < 0.0000000000000002
## firstofmonth             0.00004380  0.00003126   1.401                0.162
## as.factor(season)spring -1.96062316  0.23302587  -8.414  0.00000000000000173
## as.factor(season)summer  2.91907799  0.23295727  12.531 < 0.0000000000000002
## as.factor(season)winter -4.88876814  0.23297276 -20.984 < 0.0000000000000002
##                            
## (Intercept)             ***
## firstofmonth               
## as.factor(season)spring ***
## as.factor(season)summer ***
## as.factor(season)winter ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.427 on 295 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.7993 
## F-statistic: 298.7 on 4 and 295 DF,  p-value: < 0.00000000000000022
          The Restricted model in this case is the original model without seasonality. The full model adds seasonality as a variable. When the two models are compared, the results are stark. With an adjusted R-squared of 0.7993, the model accounting for seasonality is predicting approximately 79.9% of the variation in temperature, as compared to an adjusted R-squared of just under 0% prediction value in the restricted model as was previously covered. The p-value for the full model is essentially zero, indicating a statistically significant fit for this model to the data.
          Accounting for seasonality generates a far stronger fit for the model by parsing the swings in temperature brought on by the changing seasons. A simple linear regression is hopeless for describing a meaningful trend for the overall data. This approach reveals that seasonality has a dominant effect on monthly temperature on an annual basis. Transforming the line by season, even by a very rough and admittedly inaccurate assignment of season by month (seasons rarely fall perfectly into three month buckets, so the season category is really an approximation), vastly improves the predictive power of the model. The model has treated fall as the reference, comparing the other three seasons to it.
plot(full)

          An examination of the residuals plots for the seasonality model reveals how seasonality is factored in. The residuals vs. fitted graph diplays how the points are grouped by season in the full model, and fitted from there - reducing variance markedly compared to the residual vs. fitted graph from the restricted model. The QQ plot also reveals how residuals stay far closer to the theorized line. This impact is seen throughout - the variance described by the linear fit is smaller when seasonality is treated as a variable.

5.5 Confirming efficacy of the seasonality model through ANOVA

          To wrap up this seasonal exploration of temperature, I conducted a likelihood ratio ANOVA test on the monthly models for New Zealand temperature with and without accounting for seasonality, and included a rough graph of the two models overlayed on the data with some numeric characteristics displayed. My null hypothesis for the likelihood ratio test is that after accounting for seasonality, time is not predictive of average monthly temperature. My alternative hypothesis is that after accounting for seasonality, there is a meaningful change in monthly average temperature over time. The P-value for this test is essentially zero at 2.2e-16, leading me to reject the null hypothesis at an alpha of 0.01 and conclude that the seasonality model does predict temperature through time to a greater extent than the restricted model. The much higher R-squared of 0.7993 further strengthens this case.
anova(restricted, full, test="LRT")
## Analysis of Variance Table
## 
## Model 1: AvgTempCelsius ~ firstofmonth
## Model 2: AvgTempCelsius ~ firstofmonth + as.factor(season)
##   Res.Df     RSS Df Sum of Sq              Pr(>Chi)    
## 1    298 3030.12                                       
## 2    295  600.34  3    2429.8 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6 Conclusions

          In recent history, the conversation about climate change often points towards production and greenhouse gas emissions as contributing factors to increasing temperatures. Throughout this analysis, three separate cities have been investigated to determine whether or not some of these factors have played a contributing role in rising temperatures in a localized city environment. When investigating Dubai, India, temperature changes were examined in relation to manufacturing growth. From the regression performed above, it can be seen that there is a correlation between manufacturing growth and temperature rise, with an adjusted R squared value of 0.605 and a p-value of 0.00002883, we can say these results are significant and are likely not the result of random chance.
          When this analysis looked at Beijing, where temperature was investigated in the background of greenhouse gas concentration, there were significant results once again. Four major pollutants were considered, carbon monoxide, sulfur dioxide, nitrogen dioxide, and ozone. When a linear model was run with these four pollutants as independent variables with seasonality as a factor, it produces an adjusted R squared value of 0.823 and a p-value of 0.00000000000000022. With this information in mind, it can be concluded that the concentration of airborne pollutants is likely to have an effect on average temperature.
          Lastly, this analysis looked to determine if these rising temperatures, which are likely, in part, the effect of increased production and air pollution, had any effect on the environment. In Auckland, New Zealand, country-wide glacial volume is tracked and investigated. When a model is run with temperature as the independent variable and glacial volume as the dependent variable, and seasonality is included as a factor, it generates an adjusted R squared of 0.799 with a p-value of essentially 0, concluding that increased temperatures likely had an effect on glacial volumes.
          With the results of this analysis in mind, it is obvious that now more than ever, we as a society need to incentivize clean green production and reduce air pollution, or else it is likely we will continue to see dramatic levels of glacial ice loss throughout the world.

7 References

8 APPENDIX

#lm for NZ ANNUAL temp 
nz_annual_lm <- lm( nz_temp$AvgTempCelsius~nz_temp$parsedDate)
summary(nz_annual_lm)
## 
## Call:
## lm(formula = nz_temp$AvgTempCelsius ~ nz_temp$parsedDate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4136  -2.6826  -0.1062   2.9014   8.6922 
## 
## Coefficients:
##                       Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)        15.09429858  0.19983710  75.533 <0.0000000000000002 ***
## nz_temp$parsedDate  0.00002758  0.00001433   1.925              0.0543 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.598 on 9071 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.0004082,  Adjusted R-squared:  0.000298 
## F-statistic: 3.704 on 1 and 9071 DF,  p-value: 0.0543
plot(nz_annual_lm)